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ABSTRACT 

Data distribution, degree of data replication, and transaction 
access patterns are key Factors in determining the performance 
of distributed database systems. In order to simplify the evalua- 
tion of performance measures, database designers and researchers 
tend to make simplistic assumptions about the system. In this 
paper, we investigate the effect of modeling assumptions on the 
evaluation of one such measure, the number of transaction roll- 
backs, in a partitioned distributed database system. We develop 
six probabilistic models and develop expressions for the number 
of rollbacks under each of these models. Essentially, the models 
differ in terms of the available system information. The analyti- 
cal results so obtained are compared to results from simulation. 
From here, we conclude that most of the probabilistic models 
yield overly conservative estimates of the number of rollbacks. 
The effect of transaction commutativity on system throughput is 
also grossly undermined when such models are employed. 

1. INTRODUCTION 

A distributed database system is a collection of cooperating 
nodes each containing a set of data items (In this paper, the 
basic unit of access in a database is referred to as a data item.). 
A user transaction can enter such a system at any of these nodes. 
The receiving node, sometimes referred to as the coordinating or 
initiating node, undertakes the task of locating the nodes that 
contain the data items required by a transaction. 

A partitioning of a distributed database (DDB) occurs when 
the nodes in the network split into groups of communicating 
nodes due to node or communication link failures. The nodes 
in each group can communicate with each other, but no node in 
one group is able to communicate with nodes in other groups. We 
refer to each such group as a partition. The algorithms which al- 
low a partitioned DDB to continue functioning generally fall into 
one of two classes [Davidson et al. 1985]. Those in the first class 
take a pessimistic approach and process only those transactions 
in a partition which do not conflict with transactions in other par- 
titions, assuring mutual consistency of data when partitions are 
reunited. The algorithms in the second class allow every group 
of nodes in a partitioned DDB to perform new updates. Since 
this may result in independent updates to items in different par- 
titions, conflicts among transactions are bound to occur, and the 
databases of the partitions will clearly diverge. Therefore, they 

require a strategy for conflict detection and resolution. Usually, 
rollbacks are used as a means for preserving consistency; con- 
flicting transactions are rolled back when partitions are reunited. 
Since coordinating the undoing of transactions is a very difficult 
task, these methods are called optimistic since they are useful 
primarily in a situation where the number of items in a par- 
ticular database is large and the probability of conflicts among 
transactions is small. 

In general, determining if a transaction that successfully ex- 
ecuted in a partition is rolled back at the time the database 
is merged depends on a number of factors. Data items in the 
read-set and the write-set of the transaction, the distribution of 
these data items among the other partitions, access patterns of 
transactions in other partitions, data dependencies among the 
transactions, and semantic relation (if any) between these trans- 
actions are some examples of these factors. Exact evaluation of 


rollback probability for all transactions in a database (and hence 
the evaluation of the number of rolled back transactions) gen- 
erally involves both analysis and simulation, and requires large 
execution times [Davidson 1982; Davidson 1984]. . To overcome 
the computational complexities of evaluation, designers and re- 
searchers generally resort to approximation techniques [David- 
son 1982; Davidson 1986; Wright 1983a; Wright 1983b]. These 
techniques reduce the computation time by making simplifying 
assumptions to represent the underlying distributed system. The 
time complexity of the resulting techniques greatly depends on 
the assumed model as well as evaluation techniques. 

In this paper we are interested in determining the cfTcct of the 
distributed database models on the computational complexity 
and accuracy of the rollback statistics in a partit ioned database. 

The balance of this paper is outlined as follows. Section 2 for- 
mally defines the problem under consideration. In Section 3, we 
discuss the data distribution, replication, and transaction model- 
ing. Section 4 derives the rollback statistics for one distribution 
model. In Section 5, we compare the analysis methods for six 
models and simulation method for one mode! based on computa- 
tional complexity, space complexity, and accuracy of the measure. 
Finally, in Section 6, we summarize the obtained results. 

2. PROBLEM DESCRIPTION 

Even though a transaction 1\ in partition l\ may be rolled 
back (at merging time) by another transaction V 2 in partition Pj 
due to a number of reasons, the following two cases arc found to 
be the major contributors [Davidson 1982]. 

i p 1 ^ P 2l and there is at least one data item which is up- 

* dated by both T x and T 2 . This is referred to as a write-write 
conflict. 

ii. P, = P 2 , T 2 is rolled back, and it is a dependency parent of 
T x (i.e., T x has read at least one data item updated by T 2 , 
and T 2 occurs prior to 7\ in the serialization sequence). 

The above discussion on reasons for rollback only considers 
the syntax of transactions (i.e. read- and write-sets) and does 
not recognize any semantic relation between them. To be more 
specific, let us consider transactions T x and T 2 executed in two 
different partitions Pi and P 2 respectively. Let us also assume 
that the intersection between the write- sets of T x and T 2 is non- 
empty. Clearly, by the above definition, there is a write-write 
conflict and one of the two transactions has to be rolled back. 
However, if T\ and T 2 commute with each other, then there is no 
need to rollback either of the transactions at the time of partition 
merge [Garcia-Molina 1983; Jajodia and Speckman 1985; Jajodia 
and Mukkamala 1990]. Instead, T x needs to be executed in P 2 
and T 2 needs to be executed in P x . The analysis tn this paper 
take this property into account. 

In order to compute the number of rollbacks, it is also nec- 
essary to define some ordering (O(P)) on the partitions. For 
example, if T, and T 2 correspond to case (i) above, and do not 
commute, it is necessary to determine which of these two are 
rolled back at the time of merging. Partition ordering resolves 
this ambiguity by the following rule: Whenever two conflicting 
but non-commuting transactions are executed in two different 
partitions, then the transaction executed in the lower order par- 
tition is rolled back. 
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Since a transaction may be rolled back due to either (i) or 
(u), we classify the rollbacks into two classes: Class 1 and Class 

ba£ P !?rh y t- The f problem of estimating the number of roll- 
‘'"J* of Partition merging in a partially replicated 
distributed database system may be formulated as follows 

roller 1 u, followin 8 parameters, determine the number of 
rolled back transactions in class 1 (/?,) and class 2 (/?,). 

• n, the number of nodes in the database; 

• d, the number of data items in the database; 


set. 

With these three parameters, we can describe a number of 
models. Due to the limited space, we chose to present the results 
for six of these models in this paper. 

We chose the following six models based on their applicability 
in the current literature, and their close resemblance to practical 
systems. In all these models, the rate of arrival of transactions 
at each of the nodes is assumed to be completely known a priori 
We also assume complete knowledge of the partitions (i.e. which 
nodes are in which partitions) in all the models. 


• p, the number of partitions in the distributed system (prior 

to merge); VK 

• f, the number of transaction types; 


Model 1: Among the six chosen models, this has the max- 
imum information about data distribution, replication, and 
transactions in the system. It captures the following infor- 
mation. 


• GD, the global data directory that contains the location of 
each of the d data items; the GD matrix has d rows and n 
columns, each of which is either 0 or 1; 


• NS kl the set of nodes in partition k, V* = 

• RSj , the read-set of transaction type j = l f 2 f . . . 1 1; 

• W'Sj, the write-set of transaction type j t j = 1 , 2 , . . . 


• tbe number of transactions of type j received in par 
tition k (prior to merge), j = 1, 2, . . . k = 1, 2, p 


CM y the commutativity matrix that defines transaction 
commutativity. If CM ith true then transaction types 
and commute. Otherwise they do not commute. 


The average number of total rollbacks is now expressed as R = 

+ /tj. 


3. MODEL DESCRIPTION 

As staled in the introduction, the primary objective of this 
paper is to investigate the effect of data distribution, replication 
and transaction models on estimation of the number of rollbacks 
in a distributed database system. 

To describe a data distribution-transaction model, we char- 
acterize it with three orthogonal parameters: 

1* degree of data item replication (or the number of copies). 

2. Distribution of data item copies. 

3. Transaction characterization 

We now discuss each of these parameters in detail 

For simplicity, several analysis techniques assume that each 
data item has the same number of copies (or degree of replica- 
tion) in the database system (Coffman et a). 1981 . Some other 
techniques characterize the degree of replication of a database by 
the average degree of replication of data items in that database 
[Davidson 1986]. Others treat the degree of replication of each 
data item independently. 

Some designers and analysts assume some specific allocation 
schemes for data item (or group) copies (e.g., [Mukkamala 1987]). 
Assuming complete knowledge of data copy distribution (GD) 
is one such assumption. Depending on the type of allocation, 
such assumptions may simplify the performance analysis. Others 
assume that each data item copy is randomly distributed among 
the nodes in the distributed system (Davidson 1986]. 

Many database analysts characterize a transaction by the size 
of its read-set and its write-set. Since different transactions may 
have different sizes, these are either classified based on the sizes, 
or an average read-set size and average write-set size are used to 
represent a transaction. Others, however, classify transactions 
based on the data items that they access (and not necessarily on 
their size). In this case, transaction types are identified with their 
expected sizes and the group of data items from which these are 
accessed. An extreme example is a case where each transaction in 
the system is identified completely by its read-set and its write- 


• Replication: Data replication is specified for each data 
item. 

• Data distribution: The distribution of data items among 
the nodes in the system is represented as a distribution 
matrix (as described in Section 2). 

• Transactions: All distinct transactions executed in a 
system are represented by their read-sets and write- 
sets. Thus, for a given transaction, the model knows 
which data items are read, and which data items are 

updated The commutativity information is also com- 
pletely known and is expressed as a matrix (as de- 
scribed m Section 2). 

Model 2: This model reduces the number of transactions 
by combining them into a set of transaction types based on 
commutativity, commonalities in data access patterns, etc. 
Since the transactions are now grouped, some of the indi- 
vidual characteristics of transactions (e.g. the exact read- 
set and writes-set) are lost. This model has the following 
information. ° 

• Replication: Average degree of replication is specified 
at the system level. 

• Data distribution: Since the read- and write-set infor- 
mation is not retained for each transaction type, the 
data distribution information is also summarized in 
terms of average data items. It is assumed that the 
data copies are allocated randomly to the nodes in the 
system. 

• Transactions: A transaction type is represented by 
its read-set size, write-set size, and the number of 
data items from which selection for read and write 
is made. Since two transaction types might access the 
same data item, it also stores this overlap information 
for every pair of transaction types. The commutativ- 
ity information is stored for each pair of transaction 
types. 

Model 3: This model further reduce the transaction types 
by grouping them based only on commutativity character- 
istics. No consideration is given to commonalities in data 
access pattern or differing read-set and write-set sizes. It 
has the following information. 

• Replication: Average degree of replication is specified 
at the system level. 

• Data distribution: As in model 2, it is assumed that 
the data copies are allocated randomly to the nodes 
in the system. 

• Transactions: A transaction type is represented by 
the average read-set size and average write-set size. 
The commutativity information is stored for all pairs 
of transaction types. 

Model 4 : This model classifies transactions into three 
types, read-only, read-write, and others. Read-only trans- 
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actions commute among themselves. Read-write transac- 
tions neither commute among themselves nor commute with 
others. The others class corresponds to update transactions 
that may or may not commute with transactions in their 
own class. This fact is represented by a commute probabil- 
ity assigned to it. 

• Replication: Average degree of replication is specified 
at the system level. 

• Data distribution: As in model 2, it is assumed that 
the data copies are allocated randomly to the nodes 
in the system. 

• Transactions: Read-only class is represented by aver- 
age read-set size. The read- write class is represented 
by average read-set and write-set sizes. The others 
class is represented by the average read-set size, aver- 
age write-set size and the probability of commutation. 

Model 5: This model reduces the transactions to two 
classes: read-only and read- write. Read-only transactions 
commute among themselves. The read-write transactions 
corresponds to update transactions that may or may not 
commute with transactions in their own class. This fact is 
represented by a commute probability assigned to it. 

• Replication: Average degree of replication is specified 
at the system level. 

• Data distribution: As in model 2, it is assumed that 
the data copies are allocated randomly to the nodes 
in the system. 

• Transactions: Read-only class is represented by aver- 
age read-set size. The reAd- write class is represented 
by average read-set and write-set sizes, and the prob- 
ability of commutation. 

Model 6: This model identifies read-only transactions and 
other update transactions. But these two types have the 
same average read-set size. Update transactions may or 
may not commute with other update transactions. 

• Replication: Average degree of replication is specified 
at the system level. 

• Data distnbution: As in model 2, it is assumed that 
the data copies are allocated randomly to the nodes 
in the system. 

• Transactions; The read-set size of a transaction is de- 
noted by its average. For update transactions, we also 
associate an average write-set size and the probability 
of commutation. 

Among these, model 1 is very general, and assumes complete 
information of data distribution ( GD ), replication, and transac- 
tions. Other models assume only partial (or average^ information 
about data distribution and replication. Model 1 has the most 
information and model 6 has the least. 

4. COMPUTATION OF THE AVERAGES 

Several approaches offer potential for computing the average 
number of rollbacks for a given system environment; the most 
prominent methods are simulation and probabilistic analysis. 

Using simulation, one can generate the data distribution ma- 
trix (GD) based on the data distribution and replication policies 
of the given model. Similarly, one can generate different trans- 
actions (of different types) that can be received at the nodes in 
the network. Since the partition information is completely spec- 
ified, by searching the relevant columns of the GD matrix, it is 
possible to determine whether a given transaction has been suc- 
cessfully executed in a given partition. Once all the successful 
transactions have been identified, and their data dependencies 
are identified, it is possible to identify the transactions that need 
to be rolled back at the time of merging. The generation and 
evaluation process may have to be repeated enough number of 
times to get the required confidence in the final result. 


Probabilistic analysis is especially useful when interest is con- 
fined to deriving the average behavior of a system from a given 
model. Generally, it requires less computation time. In this pa- 
per, we present detailed analysis for model 6, and a summary of 
the analysis for models 1-5. 

4.1 Derivations for Model 6 

This model considers only two transaction types: read-only 
(Type 1) and read-write (Type 2). Both have the same average 
read-set size of r. A read-write transaction updates w of the data 
items that it reads. N\ k and /V u represent the rate of arrival of 

types I and 2 respectively at partition k. The average degree 
of replication of a data item is given as c. The system has n 
nodes and d data items. The probability that two read-write 
transaction commute is m. 

Let us consider an arbitrary transaction T\ received at one 
of the nodes in partition k with n* nodes. Since the copies of 
a data item are randomly distributed among the n nodes, the 
probability that a single data item is accessible in partition k is 
given by 



Since each data item is independently allocated, the expected 
number of data items available in this partition is da*. Similarly, 
since T\ accesses r data items (on the average), the probability 
that it will be successfully executed is aj. From here, the number 
of successful transactions in k is estimated as aJA^u an d a£A/ 2 * 
for types 1 and respectively. 

In computing the probability of rollback of T\ due to case (i), 
we are only interested in transactions that update a data item in 
the write-set of T\ and not commuting with TV The probability 
that a given data item (updated by T t ) is not updated in another 
partition k! by a non-commuting transaction (with respect to Tj) 
is given by 




( 2 ) 


Given that a data item is available in k, probability that it is 
not available in k ' is given as 


7 (*.*') 


' ~>C) ‘ 




(3) 


From here, the probability that a data item available in k is not 
updated any other transaction in higher order partitions is given 
as 


6 k = n (7(M') + (l-7 W 


The probability that transaction T t is not in write-write con- 
flict with any other non-commuting transaction of higher-order 
partitions is now given as 





From here, the number of transactions rolled back due to category 
(i) may be expressed as Ry = H*»i(l — 

To compute the rollbacks of category (ii), we need to deter- 
mine the probability that T\ is rolled back due to the rollback of 
a dependency parent in the same partition. If T? is a read- write 
transaction in partition k , then the probability that T\ depends 
on Tj (i.e. read- write conflict) is given by: 
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The probability that 7 1 is not rolled back due to the roll back of 
any of its dependency parents is now given by: 


X* 


E 


I3| 


(Xfc/ifc + 1 - x k ) ut 


(7) 


• Model 3, in addition to the space required by models 4- 
6> a]so requires 0(t 2 ) for commutativity matrix. Thus it 
requires 0(nt + f 2 ) space. 

• Model 2, in addition to the space required by model 3, 
also requires t space to store the data overalp information' 
Ihus, it requires 0(nt 4- Z 2 ) storage. 

Thus model l has the largest storage requirement and model 6 
has the least. 


where N k = N lk + N 7k and xi = N 2k j{N xk + N lk ). 

The total number of rolled back transactions due to category 
(ii) is now estimated as R 7 = £J=i(l - XkWMk + » k N 2k ). The 
total number of rolled back transactions is n = /Zj + R 7 . 

5. COMPARISON OF THE MODELS 

As mentioned in the introduction, the main objective of this 
paper is to determine the effect of data distribution, replication, 
and transaction models on the estimation of rollbacks. To achieve 
this, we evaluate the desired measure using six different data 
distribution and replication models. The comparison of these 
evaluations is based on computational time, storage requirement, 
and the average values obtained. 

Due to the limited space, wc could not present the detailed 
derivations for the average values for models 2-6. The final ex- 
pressions, however, are presented in [Mukkamala 1990]. 

5.1 Computational Complexity 

We now analyze each of the evaluation methods (for models 
1-6) for their computational complexity. 

• In model 1, all t transactions are completely specified, and 
the data distribution matrix is also known. To determine 
if a transaction is successful, we need to the scan the dis- 
tribution matrix. Similarly, determining if a transaction in 
a lower order partition is to be rolled back due to a write- 
write conflict with a transaction of higher order partition 
requires comparison of write-sets of the two transactions. 
Determining if a transaction needs to be rolled back due to 
the rollback of a dependency parent also requires a search. 
All this requires 0(ndt + p 2 t 2 + ;d 2 /V), where t is the num- 
ber of transaction types and ;V is the maximum number of 
transactions executed in a partition prior to the merge. 

• Models 2-6 have a similar computation structure. The num- 
ber of transaction types (t) is high for model 2 and low for 
model 6. Each of these models require 0(p 2 t 7 c + pt 2 N) 
time. As before, t is the number of transaction types and 
N is the maximum number of transactions executed in a 
partition prior to the merge. 

Thus, model 1 is the most complex (computationally) and model 
6 is the least complex. 

5.2 Space Complexity 

We now discuss the space complexity of the six evaluation 
methods: 

• Model 1 requires O(dn) to store the data distribution ma- 
trix, O(n) to store the partition information, 0(dt ) to store 
the data access information, and O(nt) to store the trans- 
action arrival information. It also requires 0(< J ) to store 
the commutativity information. Thus, it requires 0(dn + 
dt + nf + t 2 ) space to store model information. 

• Models 4-6 require similar information: O(t) to store the 
average size of read- and write- sets of transaction types, 
O(nf) for transaction arrival, 0(n ) for partition informa- 
tion, and 0(f) for commute information. Thus they require 
O(nf) space. 


5.3 Evaluation of the Averages 

In order to compare the effect of each of these models on 
the evaluation of the average rollbacks, we have run a number of 
experiments. In addition to the analytical evaluations for models 
1-6, we have also run simulations with Model 1. The results 
from these runs are summarized in Tables 1-7. Basically these 
tables describe the number of transactions successfully executed 
before partition merge (Before Merge), number of rollbacks due 
to class 1 (ni), rollbacks due to class 2 (R 7 ), and transactions 
considered to be successful at the completion of merge ( After 
Merge), Obviously, the last term is computed from the earlier 
three terms. In all these tables, the total number of transaction 
arrivals into the system during partitioning is taken to be 65000. 
Also, each node is assumed to receive equal share of the incoming 
transactions. 

• Table 1 summarizes the elTect of number of partitions as 
measured with Models 1-6. Here, it is assumed that each 
of the data items in the system has exactly c - 3 copies. 
The other assumptions in models 1-6 are as follows: 

1* Model 1 considers 130 transaction types in the sys- 
tem. Each is described by its read- and write-sets and 
whether it commutes with the other transactions. 90 
of the 130 are read-only transactions. The rest of the 
40 are read- write. Among the read- write, 15 commute 
with each other, another 10 commute with each other, 
and the rest of the 1 5 do not commute at all. The sim- 
ulation run takes the same inputs but evaluates the 
averages by simulation. 

2. Model 2 maps the 130 transaction types into 4 classes. 
To make the comparisons simple, the above four classes 
(90+15+10-H5) are taken as four types. The data 

overlap is computed from the information provided in 
model 1. 

3. Model 3, to facilitate comparison of results, considers 
the above 4 classes. This model, however, does not 
capture the data overlap information. 

4. Model 4 considers three types: read-only, read- write 
that commute among themselves with some probabil- 
ity, and read- write that do not commute at all. 

5. Model 5 considers read-only transactions with read-set 
size of 3 and read-write transactions with read-set size 
of 6. Read-write transactions commute with a given 
probability. 

6. Model 6 only considers the average read-set size (com- 
puted as 4 in our case), the portion of read-write trans- 
actions (=45/130), and the average write-set size for 
a read-write (= 2). Probability that any two transac- 
tions commute is taken to be 0.4. 

From Table 1 it may be observed that: 

• The analytical results from analysis of Model 1 is a 
close approximation of the ones from simulation. 

• The evaluation of number of successful transactions 
prior to the merge is well approximated by all the 
models. Model 6 deviated the most. 

• The difference in estimations of R x and R 7 is signif- 
icant across the models. Model 1 is closest to the 
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simulation. Model 6 has the worst accuracy. Model 
5, surprisingly, is somewhat better than Models 2,3,4, 
and 6. 

• The estimation of R 2 from models 2-6 is about 50 
times of the estimation from Model 1. The estima- 
tions from Model 1 and the simulation are quite close. 
From here, we can see that, Models 2-6 yield overly 
conservative estimates of the number of rollbacks at 
the time of partition merge. While Model 1 estimated 
the rollbacks as 1200, Model 2-6 have approximated 
them as about 13000. 

• This difference in estimations seems to exist even when 
the number of partitions is increased. 

• Table 2 summarizes the effect of number of copies on the 
evaluation accuracies of the models. It may be observed 
that 

• The difference between evaluations from Model 1 and 
the others is significant at low (c = 3) as well as high 
(c = 8) values of c. Clearly, the difference is more 
significant at high degrees of replication. 

• The case pi = 4,p? = 6,c = 8 corresponds to a case 
where each of the 500 data items is available in both 
the partitions. This is also evident from the fact that 
all the 65000 input transactions are successful prior to 
the merge. 

• The results from the analysis and simulation of Model 
1 are close to those from simulation. 

• Table 3 shows the effect of increasing the number of nodes 
from 10 (in Table 1) to 20. For large values of n, all the six 
models result in good approximations of successful trans- 
actions prior to merge. The differences in estimations of /?j 
and R 2 still persist. 

• Table 4 compares models 5 and 6. While model 6 only re- 
tains average read-set size information for any transaction, 
model 6 keeps this information for read-only and read- write 
transactions separately. This additional information en- 
abled model 5 to arrive at better approximations for R\ 
and R 2 . In addition, the effect of commutativity on Ri and 
R 2 is not evident until m > 0.99. This is counterintuitive. 
The simplistic nature of the models is the real cause of this 
observation. Thus, even though these models have resulted 
in conservative estimates of H l and R 2 , we can‘t draw any 
positive conclusions about the effect of commutativity on 
the system throughput, 

• The comments that were made about the conservative na- 
ture of the estimates from models 5 and 6 also applies to 
model 2. These results are summarized in Table 5. Even 
though this model has much more system information than 
models 5 and 6, the results (/?, and R 2 ) are not very differ- 
ent. However, the effect of commutativity can now be seen 
at m > 0.95. 

• Having observed that the effect of commutativity is almost 
lost for smaller values of m in models 2-6, we will now look 
at its effect with model 1. These results are summarized 
in Table 6. Even at small values of m, the effect of com- 
mutativity on the throughput is evident. In addition, it 
increases with m. This observation holds at both small 
and large values of c. 

• In Table 7, we summarize the effect of variations in num- 
ber of copies. In Tables 1-6, we assumed that each data 
item has exactly the same number of copies. This is more 
relevant to Model 1. I hus we only consider this model in 
determining the effect of copy variations on evaluation of R\ 
and R 2 . As shown in this table, the effect is significant. As 
the variation in number of copies is increased, the number 
of successful transactions prior to merge decreases. Hence, 
the number of conflicts are also reduced. This results in 


a reduction of Ry and R 2 . AS long as the variations are 
not very significant, the differences are also not significant. 

6. CONCLUSIONS 

In this paper, we have introduced the problem of estimating 
the number of rollbacks in a partitioned distributed database sys- 
tem. We have also introduced the concept of transaction commu- 
tativity and described its effect on transaction rollbacks. For this 
purpose, the data distribution, replication, and transaction char- 
acterization aspects of distributed database systems have been 
modeled with three parameters. We have investigated the effect 
of six distinct models on the evaluation of the chosen metric. 
These investigations have resulted in some very interesting ob- 
servations. This study involved developing analytical equations 
for the averages, and evaluating them for a range of parameters. 
We also used simulation for one of these models. Due to lack 
of space, we could not present all the obtained results in this 
paper. In this section, we will summarize our conclusions from 
these investigations. 

We now summarize these conclusions. 

• Random data models that assume only average information 
about the system result in very conservative estimates of 
system throughput. One has to be very cautious in inter- 
preting these results. 

• Adding more system information does not necessarily lead 
to better approximations. In this paper, the system infor- 
mation is increased from model 6 to model 2. Even though 
this increases the computational complexity, it does not 
result in any significant improvement in the estimation of 
number of rollbacks. 

• Model 1 represents a specific system. Here, we define the 
transactions completely. Thus it is closer to a real-life sit- 
uation. Results (analytical or simulation) obtained from 
this model represent actual behavior of the specified sys- 
tem. However, results obtained from such a model are too 
specific, and can't be extended for other systems. 

• Transaction commutativity appears to significantly reduce 
transaction rollbacks in a partitioned distributed database 
system. This fact is only evident from the analysis of model 
1. On the other hand, when we look at models 2-6, it is 
possible to conclude that commutativity is not helpful un- 
less it is very very high. Thus, conclusions from model 1 
and models 2-6 appear to be contradictory. Since mod- 
els 3-6 assume average transactions that can randomly se- 
lect any data item to read (or write), the evaluations from 
these models art* likely to predict higher conflicts and hence 
more rollbacks. The benefits due to commutativity seem to 
disappear in the average behavior. Model 1, on the other 

hand, describes a specific system, and hence can accurately 
compute the rollbacks. It is also able to predict the benefits 
due to commutativity more accurately. 

• The distribution of number of copies seems to affect the 
evaluations significantly. Thus, accurate modeling of this 
distribution is vital to evaluation of rollbacks. 

In addition to developing several system models and evalua- 
tion techniques for these models, this paper has one significant 
contribution to the modeling, simulation, and performance anal- 
ysis community. 

If an abstract system model with average information is 
employed to evaluate the effectiveness of a new technique 
or a new concept, then we should only expect conservative 
estimates of the effects. In other words, if the results from 
the average models are positive, then accept the results 
If these are negative, then repeat the analysis with a less 
abstracted model. Concepts/techniques that are not ap- 
propriate for an average system may still be applicable for 
some specific systems. 
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Table 1. Effect of Number of Partitions on Rollbacks 


Model 

If 

Pi 

= 

II 

Ci 

11 

3 

Pi = 

5 

n 

P3 = 3, 

c = 3 

Before 

Mrrgr 

Hi 

lb 

A ftcr 

Mrrgr 

Before 

Merge 

/?. 

Hi 

After 

Merge 

Sim. 

50200 

1000 

205 

48995 

.11450 

0 

0 

31450 

1 

50200 

1000 

199 

49001 

31450 

0 

0 

31450 

2 

48315 

3597 

10322 

34397 

27069 

3460 

8945 

14664 

3 

48315 

3464 

10194 

34657 

27069 

2798 

9410 

14861 

4 

48618 

3667 

10243 

34708 

27657 

3255 

9444 

14958 

5 

47276 

2679 

10238 

34360 

24207 

1507 

9106 

13594 

6 

46593 

3852 

8570 

34171 

22356 

2937 

6673 

12747 


Table 2. Effect of Number of Copies on Rollbacks 



Pi 

= 4,pj 

= 6, c 
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Pi 
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II 

o> 

o 

II 

8 
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# 

Before 
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After 

Merge 

Before 

Merge 

Pi 

Pi 

After 

Merge 

Sim. 

34600 

200 

15 

34385 

65000 

4000 

4970 

56030 

1 

34600 

200 

0 

34400 

65000 

4000 

4981 

56019 

2 

31069 

1998 

5119 

23952 

65000 

8000 

17777 

39223 

3 

31069 

1601 

5334 

24134 

65000 

8000 

17786 

•39214 

4 

31595 

1798 

5420 

24377 

65000 

8000 

17786 

39214 

5 

23203 

1568 

2326 

19309 

65000 

8000 

17875 

39125 

6 

27138 

3413 

1701 

22024 

65000 

8000 

17860 

39140 


Table 3. Effect of Number of Nodes on Rollbacks 
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# 

Before 

Merge 

Ml 
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Merge 

Before 

Merge 

ft. 
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^Merge^ 
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I 

G 1 250 
Cl 250 

4000 

G240 

51010 

G5000 

5000 

6231 

53769 

4000 

6231 

51019 

65000 

5000 

6231 

53769 

2 
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9090 

21183 

30751 

65000 

10000 

22277 

32723 

3 

61024 

8992 

21286 

30746 

65000 

10000 

22286 

32714 

4 

61100 

9031 

21326 

30743 

65000 

10000 

22286 

32714 

5 

60968 

9064 

21292 

30613 

65000 

10000 

22375 

32625 

6 

60876 

9363 

20936 

30577 

65000 

10000 

22360 

32640 
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Effects of Distributed Database Modeling on Evaluation of Transaction Rollbacks 
Table 4. Effect of m on Rollbacks (Models 5 and 6: p, = 4,p, = 6,c = 3 


Model 5 


Model G 


m 

Before 

Merge 

R , 

R, 

After 

Merge 

Before 

Merge 


ih 

After 

Merge 

0.00 

47276 

2679 

10238 

34360 

46593 

3852 

8570 

34171 

0.50 

47276 

2679 

10238 

34360 

4C593 

3852 

8570 

34171 

0.80 

47276 

2679 

10238 

34360 

46593 

3852 

8570 

34171 

0.90 

47276 

2679 

10238 

34360 

46593 

3848 

8574 

34171 

0.95 

47276 

2678 

10239 

34360 

46593 

3774 

8774 

34175 

0.99 

47276 

2208 

10665 

34403 

46593 

2182 

10109 

34301 

1.00 

46726 

0 

0 

4672G 

46593 

0 

0 

46593 

Table 5. 

Effect of m on 

Rollback 

s (Model 

2: p, 

= = 

G) 



c = 

3 



c 

= 8 


m 

Before 

R i 

Ri 

After 

Before 


Ri 

After 


Merge 



Merge 

Merge 



Merge 

0.0 
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3597 

10322 

34397 

65000 
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17973 

39027 

0.27 

48315 

3597 

10322 

34397 

65000 

8000 

17973 

39027 

0.40 

48315 

3597 

10322 

34397 

65000 

8000 

17973 

39027 

0.77 

48315 

3597 

10322 

31397 

65000 

8000 

17973 

39027 

0.95 

48315 

3205 

[0708 

34402 

65000 

7660 

18312 

39028 

0.99 

48315 

980 

12882 

34447 

65000 

4321 

21642 

39037 

1.0 

48315 

0 

0 

48315 

65000 

0 

0 

65000 


Table 6. Effect of in on Rollbacks (Model 1: pi = 4,pj = 6) 
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0.0 
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50200 0 U 50200 

65000 

0 

0 
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0 

0 
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Table 7 . Effect of Variations in 

# of Copies on 

Rollbacks 
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/* This program creates a menu and facilitates updates, inserts and 
deletes of records in a database. EMPP is an employee database and 
the program assumes it to be already created with the following 

fields: 

EMPNO (employee number) of type numeric 
ENAME (employee name) of type character 

SAL (salary) of type numeric with provision for 2 places after 
decimal 

DEPTNO ( department number) of type numeric 
JOB (job name) of type character 

*/ 

#include <stdio.h> 

#include <ctype.h> 


EXEC SQL BEGIN DECLARE SECTION; 

VARCHAR uid[80]; /* variable for user id */ 

VARCHAR pwd [20] ; /* variable for password */ 

int empno; /* host variable for primary key - employee number */ 
VARCHAR ename[15] ; /* host variable for employee name */ 

int deptno; /* department number */ 

VARCHAR job[15] ; /* host variable for job */ 

int sal; /* host variable for salary */ 

1=0; /* host variable to hold the length of the srting - a 
value returned by the asks() function. */ 

int count; /* a variable to obtain number of records in the 
database with the same primary key value */ 

int reply =0; /* variable to obtain the whether a new value exists 
*/ 

int choice =0; /* variable defined to obtain value for the menu */ 

int code; /* variable to print to the ascii file to indicate wether 
the record was updated (value=l) , inserted (value=2) and deleted 
(value=3) */ 

EXEC SQL END DECLARE SECTION; 

EXEC SQL INCLUDE SQLCA ; 

FILE *fp; 

main ( ) 

/* open ascii file in append mode */ 



fp = fopen("outf ile" , "a"); 

/* give the login and password to logon to the database */ 
strcpy (uid.arr, "rsp" ) ; 
uid.len = strlen (uid. arr) ; 
strcpy (pwd. arr, "prs") ; 
pwd.len = strlen (pwd. arr) ; 

/* exit in case of an unauthorized accessor to the database */ 

EXEC SQL WHENEVER SQLERROR GOTO errexit; 

EXEC SQL CONNECT :uid IDENTIFIED BY :pwd; 
for (;;) 

{ /* infinite loop begins */ 

/* menu for selecting update, insert, and delete options */ 

printf("\n \n 1. Update a record \n") ; 

printf ("\n \n 2. Insert a record \n" ) ; 

printf ("\n \n 3. Delete a record \n") ; 

printf("\n \n Select an option 1/2/3 ? \n") ; 

choice = getche ( ) ; 

if (choice == '1') goto update; 

else if (choice == '2') goto insert; 

else if (choice ==' 3 ') goto delete; 

else { printf ("invalid selection"); 
exit (1) ; } 

update: /* label for the update option */ 

{ 

/* To ensure that the employee with the given employee number 
exists, before update could be made. */ 
code = 1; 

printf (" \n count is %d \n", count); 

askn( "Enter employee number to be updated: ", &empno) ; 

/* using the COUNT supported by oracle, the number of records 
having the desired employee number is assigned the variable count 
*/ 


EXEC SQL SELECT COUNT (EMPNO) INTO : count 
FROM EMPP 

WHERE EMPNO = :empno; 

printf ("count is %d \n", count); 

if (count == 0) 

{ printf ( "Employee with employee number %d does not exist \n", 
empno) ; 

exit (1) ; } 

/* retrieve the information from the database whose employee-number 
has been requested for, and place the contents of the fields into 
C variables for update purposes. */ 


EXEC SQL SELECT ENAME, SAL, DEPTNO , JOB 



INTO rename, :sal, : deptno, :job 
FROM EMPP 

WHERE EMPNO = :empno; 

/* displays the already existing value for employee name */ 

/* assign the new employee name if it should be updated */ 

printf ("ename is %s \n", ename.arr); 

printf ("Do you want to update ENAME: (y/n) ?") ; 

reply = getche(); 

if (reply == 'n' ) { 
ename = ename; 

printf ("\n ename is %s \n", ename.arr);} 
if (reply == 'y') { 

1 = asks("\n enter employee name : ", ename.arr); 
printf ("new ename is %s \n", ename.arr);} 

/* displays the already existing value for job name */ 

/* assign the new job if it should be updated */ 

printf ("do you want to update job-name: (y/n) ?") ; 
reply = getche(); 

if (reply == 'n' ) { 
job = job; 

printf ("\n job-name is %s \n", job.arr);} 

if (reply == 'y'){ . . . 

job.len = asks("\n enter employee's 30b job.arr); 

printf ("new job-name is %s \n", job.arr);} 

/* displays the already existing value for salary */ 

/* assign new salary if it should be updated */ 

printf ("Do you want to update salary : (y/n) " ); 
reply = getche(); 

if (reply == 'n' ) { 
sal = sal; 

printf ("\n salary is %d \n", sal);} 
if (reply == 'y' ) { 

askn("\n enter employee's salary: ", &sal) ; 
printf ("new salary is %d \n", sal);} 

/* displays the already existing value for department number */ 
/* assign the new department number if it should be updated */ 

printf ("Do you want to update deptno :(y/n)"); 
reply = getche ( ) ; 

if (reply == 'n' ) { 
deptno = deptno; 

printf ("\n deptno is %d \n", deptno);} 
if (reply == 'y ' ) { 


askn("\n Enter employee dept : " , fcdeptno); 

printf ("new deptno is %d \n", deptno) ; > 

/* update the database with the new values */ 

EXEC SQL UPDATE EMPP 

SET ENAME = : ename , SAL = :sal, DEPTNO = : deptno, JOB = :job 
WHERE EMPNO = : empno; 

printf("\n %s with employee number %d has been updated\n", 
ename . arr , empno) ; 

fprintf (fp, "%10d %ld %15s", empno, code, ename. arr) ; 
fprintf (fp, "%6d %3d %4s\n", sal, deptno, job. arr); 

printf ( "%10d %15s %6d H , empno, ename. arr, sal); 
printf ("%3d %4s\n", deptno, job. arr); 

} 

insert: /* label for insertion of record based on the employee 

number */ 

{ 

code = 2 ; 

/* To prevent insertion of a record whose primary key is the same 
as the primary key of an already existing record */ 

askn("\n Enter employee number to be inserted:", & empno ) ; 
EXEC SQL SELECT COUNT (EMPNO) INTO : count 
FROM EMPP 

WHERE EMPNO = : empno; 

printf ( "count is %d \n", count); 

if (count > 0) { 

printf ("Employee with %d employee number already exists \n", 

empno) ; 

exit (1) ; } 

else { /* obtain values for various fields to be inserted */ 
1 = asks ("Enter employee name : ", ename. arr); 
job.len = asks("Enter employee job :", job. arr); 
askn( "Enter employee salary :" , &sal) ; 
askn( "Enter employee dept number , deptno); 

/* insert the values obtained into the database */ 

EXEC SQL INSERT INTO EMPP (EMPNO, ENAME, JOB, SAL, DEPTNO) 
VALUES (:empno, :ename, :job, :sal, : deptno ) ; 

/* append the insert into the ascii file */ 

fprintf (fp, "%10d %ld %15s ", empno, code, ename. arr); 
fprintf (fp, "%6d %3d %4s \n", sal, deptno, job. arr); 



printf ("%10d %15s %6d", empno, ename.arr, sal); 
printf ( "%3d %4s", deptno, job. arr) ; } 


delete: /* label for deletion of records based on employee number 

*/ 

{ 

int code = 3 ; 

/* obtain the employee number of the employee to be deleted */ 

askn( "Enter employee number to be deleted : " , & empno ) ; 

EXEC SQL SELECT COUNT (EMPNO) INTO : count 
FROM EMPP 

WHERE EMPNO = : empno; 

if (count > 0) { /* delete record if it exists */ 

EXEC SQL DELETE FROM EMPP WHERE EMPNO = : empno; 
printf ( "Employee number %d deleted \n", empno); 

fprintf (fp, "%10d %ld\n", empno, code);} 

else { 

printf ( "Employee with number %d does not exist \n", empno); 
exit ( 1) ;} 

} 


EXEC SQL COMMIT WORK RELEASE; /* make the changes permanent */ 
printf ( " \n End of the C/ORACLE example program. \n" ) ; 
return ; 

fclose(fp) ; 

EXEC SQL WHENEVER SQLERROR CONTINUE; 

EXEC SQL ROLLBACK WORK RELEASE; /* in case of inconsistency */ 

return ; 

errexit : 

errrpt ( ) ; 


} 

} /* infinite loop ends */ 

/* function takes the text to be printed and accepts a srting 
variable from standard input and converts it into numeric - hence 
is used to obtain values for numeric fields */ 

int askn (text, variable) 
char text [ ] ; 
int *variable; 

{ 

char s[20] ; 
printf (text) ; 



fflush(stdout) ; 
if (gets(s) == (char *)0) 
return (EOF) ; 

♦variable = atoi(s); 
return ( 1 ) ; 

> 

/* function takes the text to be printed and prints it, accepts 
string values for character variables and is thus used to obtain 
values for fields of type character. It returns the length of the 
string value */ 

int asks (text , variable) 

char text [ ] , variable [ ] ; 

{ 

printf (text) ; 
f flush (stdout) ; 

return ( gets (variable) == (char *) 0 ? EOF : 

strlen (variable) ) ; 

} 

errrpt ( ) 

{ 

printf ("%. 70s ( % d ) \ n " , sqlca . sqlerrm . sqlerrmc , 

-sqlca. sqlcode) ; 
return (0) ; 

} 



